In this tutorial, we will learn how to set up a R Markdown document for reproducible statistical analysis/modeling, primarily using functions and syntax from the KnitR package. On Tuesday, we will ensure that everyone has downloaded the right data and packages and go over code chunk setup and begin analysis?. On Thursday, we will learn different ways to source and code analysis chunks, how to ensure your analysis is reproducibly random, and how to efficiently run computationally-intensive analysis.
We will learn how to dynamically connect data gathering and source code to markdown documents, setup pipelines to rerun analysis and present results whenever compiled, and keep markdown documents up to date and reflecting changes made to the data or analysis in source code files.
Coding is hard. It takes a long time, and you’ll likely have to go back to the same analysis multiple times to tweak and change things. It is important to setup a good workflow early to ensure you’re not scrambling to make your research reproducible at the end of the process. You’ll be saving yourself a lot of time if your analysis is reproducible and your results presented every time you knit your document.
Through this tutorial, students will learn to:
Make sure you have the following packages: “knitr”, “rmarkdown”, “bookdown”, “formattable”, “kableExtra”, “dplyr”, “magrittr”, “prettydoc”, “htmltools”, “knitcitations”, “bibtex”, “devtools”, “tidyverse” Not sure we’ll actually need all this. Below are the code chunks for loading packages and setting up the document from
###~~~
# Load R packages
###~~~
#Create a vector w/ the required R packages
# --> If you have a new dependency, don't forget to add it in this vector
pkg <- c("knitr", "rmarkdown", "bookdown", "formattable", "kableExtra", "dplyr", "magrittr", "prettydoc", "htmltools", "knitcitations", "bibtex", "devtools", "tidyverse")
##~~~
#2. Check if pkg are already installed on your computer
##~~~
print("Check if packages are installed")
#This line below outputs a list of packages that are not installed. Which ones of the packages above are installed in computer, print packages that are not installed. Should print 0.
new.pkg <- pkg[!(pkg %in% installed.packages())]
##~~~
#3. Install missing packages
##~~~
# Use an if/else statement to check whether packages have to be installed
# WARNING: If your target R package is not deposited on CRAN then you need to adjust code/function
if(length(new.pkg) > 0){
print(paste("Install missing package(s):", new.pkg, sep=' '))
install.packages(new.pkg, dependencies = TRUE)
}else{
print("All packages are already installed!")
}
##~~~
#4. Load all required packages
##~~~
print("Load packages and return status")
#Here we use the sapply() function to require all the packages
# To know more about the function type ?sapply() in R console
# Just an easier way to load all packages
sapply(pkg, require, character.only = TRUE)
## Loading required package: knitr
## Loading required package: rmarkdown
## Loading required package: bookdown
## Loading required package: formattable
## Loading required package: kableExtra
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: magrittr
## Loading required package: prettydoc
## Loading required package: htmltools
## Loading required package: knitcitations
## Loading required package: bibtex
## Loading required package: devtools
## Loading required package: usethis
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract() masks magrittr::extract()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::set_names() masks magrittr::set_names()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# debug = right-pointing arrow next to code will load in console so you can go line by line and debug. pressing this also seemed to resolve the CRAN needs a mirror error
# Generate BibTex citation file for all loaded R packages
# used to produce report Notice the syntax used here to
# call the function
knitr::write_bib(.packages(), file = "packages.bib")
# The .packages() function returns invisibly the names of all packages loaded in the current R session (to see the output, use .packages(all.available = TRUE)). This ensures that all packages used in your code will have their citation entries written to the .bib file. This packages is then used in the Appendix to cite the code used
Download data from google drive. For this chapter, we will be working with a couple of datasets. Mike is working on consolidating some different exploratory analyses he did at different times over the last year. He wants to get them in the same place and make sure that they are reproducible. We’ll start with a dataset of “encounter rates” calculated from eBird data, which are calculated as the percentage of checklists that include a particular species in a given area (Schuetz and Johnston, 2019). This data has been analyzed for the United States by state, and Mike is getting ready to do encounter rates for species in Sonora, Mexico that are also found in Idaho. Before digging into the eBird data for Sonora, Mike wanted to get to know the dataset for Idaho so that he knows what species to look for in the Sonora data. Let’s get that data ready.
#Load Data
library(tidyverse)
#load csv file of data
state_enc_rate <- read.csv("state.level.enc.rate.csv")
view(state_enc_rate)
#Get rid of uneccessary columns
state_enc_rate_clean <- state_enc_rate %>%
select(
-query.volume.state #not needed for analysis
)
view(state_enc_rate_clean)
#filter to just Idaho
state_enc_rate_clean_filtered <- state_enc_rate_clean %>%
filter(
state == "Idaho",
)
view(state_enc_rate_clean_filtered)
#get rid of species with no encounter rate
ID_enc_rate <- state_enc_rate_clean_filtered %>%
filter(encounter.state.normalized!=0
)
#get ride of encounter rates below 50 to focus on "commonly encou
ID_enc_rate <- ID_enc_rate %>%
filter(encounter.state.normalized > 50)
view(ID_enc_rate)
Setup Chunk Options:
### Chunk options: see http://yihui.name/knitr/options/ ###
### Text output
opts_chunk$set(echo = TRUE, warning = TRUE, message = TRUE, include = TRUE)
## Code formatting
opts_chunk$set(tidy = TRUE, tidy.opts = list(blank = FALSE, width.cutoff = 60),
highlight = TRUE)
## Code caching
opts_chunk$set(cache = 2, cache.path = "cache/")
## Plot output The first dev is the master for the output
## document
opts_chunk$set(fig.path = "Figures_MS/", dev = c("png", "pdf"),
dpi = 300)
## Figure positioning
opts_chunk$set(fig.pos = "H")
Although dynamically linking source code to markdown documents is the main goal of this chapter, sometimes it makes sense to include short code chunks directly into your markdown document. There are a variety of customizable options on how the code and its output are display in the final knitted document. We use code chunk arguments from the knitr package to choose the parts of our code chunks that are outputted in the final html document.
#Load Data
First we are going to to do a little bit more data transformation to make a couple of objects of the Idaho encounter rate data grouped taxonomically for some descriptive statistics we will use later.
# Pull passerines from ID encounter rates list
passerine_enc_rate <- ID_enc_rate %>%
filter(common.name %in% c("American robin", "bank swallow",
"black-billed magpie", "black-headed grosbeak", "Brewer's blackbird",
"Bullock's oriole", "Cassin's finch", "Cassin's vireo",
"cedar waxwing", "cliff swallow", "dark-eyed junco",
"dusky flycatcher", "European starling", "evening grosbeak",
"Hammond's flycatcher", "house finch", "lazuli bunting",
"MacGillivray's warbler", "mountain chickadee", "northern rough-winged swallow",
"olive-sided flycatcher", "pine siskin", "red crossbill",
"red-winged blackbird", "sage thrasher", "song sparrow",
"western kingbird", "western tanager", "western wood-pewee",
"willow flycatcher", "yellow warbler"))
waterfowl_enc_rate <- ID_enc_rate %>%
filter(common.name %in% c("American coot", "American wigeon",
"cinnamon teal", "common goldeneye", "common merganser",
"mallard", "ring-necked duck", "western grebe"))
What do the chunk commands do:
We use code chunk arguments from the knitr package to choose the parts of our code chunks that are outputted in the final html document.
Code chunk arguments:
Take 5 minutes to discuss with the person sitting next to you. Given the same chunk of analysis, how would you set up your chunk arguments when the document is being sent to your advisor for review versus being sent to a journal for publication? Then share with the class.
For eval and echo, you can tell knitr which expressions in the chunk to include using numerical vectors. For example, eval = 1:2 will only include the 1st 2 expressions of code in the output document.
Sometimes you may want to have R code or output appear inline with the rest of the markdown document. These can be static or dynamic.
ID_enc_rate <- state_enc_rate_clean_filtered %>% filter(encounter.state.normalized!=0 )We will go over three ways to include analysis in your R Markdown document: locally sourced modular analysis, url-sourced modular analysis, and using a code chunk. We will also go over the scenarios in which you might use each method.
You may have previously written code locally stored on your computer that you want to pull for an analysis. To do so, we will save that code in its own R file and use the source command from knitR to call it. This is most helpful in the early stages of your research, as this is not the best practice for ensuring reproducible science.
Use the argument include = FALSE to run the analysis and produce objects that can still be used by other code chunks but will not show up in the output document.
make a separate r file that calculates means of passerine encounter rates, then source it here.
Once you are further along in your research, you may want to make sure your file in publicly accessible to ensure reproducibility. You can host your file in a GitHub repository and use the source_url command from the devtools package to call it.
We have set up a pre-made GitHub file for this section.
source from pre-made github file. pre-made file on shorebird encounter rate means
Lastly, the simplest way to include analysis is to analyze the data within a code chunk. This is the best method for reproducibility as all your analysis is in one document, making it easier for other to reuse your code.
For this tutorial, we will create a code chunk with a simple analysis that will generate a figure comparing encounter rates between Idaho and Sonora.
What code chunk arguments should we use if we want to figure to show up in the output document but not the code?
Create figure comparing encounter rates between idaho and sonora. Ask students what the code chunk arguments should be.
When an analysis produces simulations, a random number generator is generally used so that others can replicate your results. For this tutorial, we will do an exercise to learn the importance of using the set.seed command to ensure replicability.
You and the person next to you use the same distribution variables but diff vs. same set.seed. Do you get the same result? What about when you use the same set.seed? Insert set.seeds here.
# Set seed as 125
set.seed(125)
# Draw 1000 numbers
Draw2 <- rnorm(1000, mean = 0, sd = 2) # Summarize Draw1
# Draw 1000 other numbers
Draw1 <- rnorm(1000, mean = 5, sd = 1.5)
summary(Draw2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7.2112 -1.4071 -0.1040 -0.1215 1.3155 5.6772
plot(Draw2)
plot(Draw1)
model <- lm(Draw1 ~ Draw2)
plot(model)
# Set seed as 150
set.seed(150)
# Draw 1000 numbers
Draw3 <- rnorm(1000, mean = 0, sd = 2) # Summarize Draw1
# Draw 1000 other numbers
Draw4 <- rnorm(1000, mean = 5, sd = 1.5)
summary(Draw2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7.2112 -1.4071 -0.1040 -0.1215 1.3155 5.6772
plot(Draw2)
plot(Draw1)
model1 <- lm(Draw3 ~ Draw4)
plot(model)
# Save sample
save(model, file = "model.RData")
When an analysis is too computationally intensive, knowing how to use the cache functions from knitR is helpful. Caching a code chunk will store the results of its analysis in a cache folder.
To cache a code chunk, use the argument cache = TRUE. This will ensure the chunk will only run if the chunk’s contents or options change.
# Load Sample
load(file = "model.RData")
summary(model)
##
## Call:
## lm(formula = Draw1 ~ Draw2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0751 -0.9833 -0.0453 0.9580 5.1042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.06839 0.04692 108.010 <2e-16 ***
## Draw2 0.01074 0.02326 0.462 0.644
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.481 on 998 degrees of freedom
## Multiple R-squared: 0.0002136, Adjusted R-squared: -0.0007882
## F-statistic: 0.2132 on 1 and 998 DF, p-value: 0.6444
plot(model)
However, subsequent chunks will not be able to access objects created by the chunks or any packages loaded in it (all packages should be loaded in their own chunk anyways though), and the cached chunk will not run if a previous chunk with code it depended on changes. You can solve this last problem with the argument dependson. A chunk with the dependson argument will rerun if the specified chunk is also rerun. To specify a chunk, use a vector of their labels or their number order from the start of the document. For example, dependson = c(2,3) will rerun the cached chunk if the 2nd and 3rd chunks of the document are also rerun. You can also save objects created by a cached chunk to a separate RData file and load that in subsequent chunks with the load command.
Try cache = TRUE on the code chunk from earlier. Knit the document. What happens? Run the document again. What happens?
should we try to do the rdata thing?
Citations of all R packages used to generate this report. Reads and prints citations stored in packages.bib
### Load R package
library("knitcitations")
### Process and print citations in packages.bib Clear all
### bibliography that could be in the cash
cleanbib()
# Set pandoc as the default output option for bib
options(citation_format = "pandoc")
# Read and print bib from file
read.bibtex(file = "packages.bib")
[1] J. Allaire, Y. Xie, C. Dervieux, et al. rmarkdown: Dynamic Documents for R. R package version 2.28. 2024. https://github.com/rstudio/rmarkdown.
[2] S. M. Bache and H. Wickham. magrittr: A Forward-Pipe Operator for R. R package version 2.0.3. 2022. https://magrittr.tidyverse.org.
[3] C. Boettiger. knitcitations: Citations for Knitr Markdown Files. R package version 1.0.12. 2021. https://github.com/cboettig/knitcitations.
[4] J. Cheng, C. Sievert, B. Schloerke, et al. htmltools: Tools for HTML. R package version 0.5.7. 2023. https://github.com/rstudio/htmltools.
[5] R. Francois and D. Hernangómez. bibtex: Bibtex Parser. R package version 0.5.1. 2023. https://github.com/ropensci/bibtex.
[6] G. Grolemund and H. Wickham. “Dates and Times Made Easy with lubridate”. In: Journal of Statistical Software 40.3 (2011), pp. 1-25. https://www.jstatsoft.org/v40/i03/.
[7] K. Müller and H. Wickham. tibble: Simple Data Frames. R package version 3.2.1. 2023. https://tibble.tidyverse.org/.
[8] Y. Qiu. prettydoc: Creating Pretty Documents from R Markdown. R package version 0.4.1. 2021. https://github.com/yixuan/prettydoc.
[9] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2023. https://www.R-project.org/.
[10] K. Ren and K. Russell. formattable: Create Formattable Data Structures. R package version 0.2.1. 2021. https://renkun-ken.github.io/formattable/.
[11] V. Spinu, G. Grolemund, and H. Wickham. lubridate: Make Dealing with Dates a Little Easier. R package version 1.9.3. 2023. https://lubridate.tidyverse.org.
[12] H. Wickham. forcats: Tools for Working with Categorical Variables (Factors). R package version 1.0.0. 2023. https://forcats.tidyverse.org/.
[13] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. ISBN: 978-3-319-24277-4. https://ggplot2.tidyverse.org.
[14] H. Wickham. stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.5.1. 2023. https://stringr.tidyverse.org.
[15] H. Wickham. tidyverse: Easily Install and Load the Tidyverse. R package version 2.0.0. 2023. https://tidyverse.tidyverse.org.
[16] H. Wickham, M. Averick, J. Bryan, et al. “Welcome to the tidyverse”. In: Journal of Open Source Software 4.43 (2019), p. 1686. DOI: 10.21105/joss.01686.
[17] H. Wickham, J. Bryan, M. Barrett, et al. usethis: Automate Package and Project Setup. R package version 2.2.2. 2023. https://usethis.r-lib.org.
[18] H. Wickham, W. Chang, L. Henry, et al. ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. R package version 3.5.1. 2024. https://ggplot2.tidyverse.org.
[19] H. Wickham, R. François, L. Henry, et al. dplyr: A Grammar of Data Manipulation. R package version 1.1.4. 2023. https://dplyr.tidyverse.org.
[20] H. Wickham and L. Henry. purrr: Functional Programming Tools. R package version 1.0.2. 2023. https://purrr.tidyverse.org/.
[21] H. Wickham, J. Hester, and J. Bryan. readr: Read Rectangular Text Data. R package version 2.1.5. 2024. https://readr.tidyverse.org.
[22] H. Wickham, J. Hester, W. Chang, et al. devtools: Tools to Make Developing R Packages Easier. R package version 2.4.5. 2022. https://devtools.r-lib.org/.
[23] H. Wickham, D. Vaughan, and M. Girlich. tidyr: Tidy Messy Data. R package version 1.3.1. 2024. https://tidyr.tidyverse.org.
[24] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. Boca Raton, Florida: Chapman and Hall/CRC, 2016. ISBN: 978-1138700109. https://bookdown.org/yihui/bookdown.
[25] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.43. 2025. https://github.com/rstudio/bookdown.
[26] Y. Xie. Dynamic Documents with R and knitr. 2nd. ISBN 978-1498716963. Boca Raton, Florida: Chapman and Hall/CRC, 2015. https://yihui.org/knitr/.
[27] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014.
[28] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.44. 2023. https://yihui.org/knitr/.
[29] Y. Xie, J. Allaire, and G. Grolemund. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman and Hall/CRC, 2018. ISBN: 9781138359338. https://bookdown.org/yihui/rmarkdown.
[30] Y. Xie, C. Dervieux, and E. Riederer. R Markdown Cookbook. Boca Raton, Florida: Chapman and Hall/CRC, 2020. ISBN: 9780367563837. https://bookdown.org/yihui/rmarkdown-cookbook.
[31] H. Zhu. kableExtra: Construct Complex Table with kable and Pipe Syntax. R package version 1.4.0. 2024. http://haozhu233.github.io/kableExtra/.
Version information about R, the operating system (OS) and attached or R loaded packages. This appendix was generated using sessionInfo().
# Load and provide all packages and versions
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS 15.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Boise
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] devtools_2.4.5 usethis_2.2.2 bibtex_0.5.1
## [4] knitcitations_1.0.12 htmltools_0.5.7 prettydoc_0.4.1
## [7] magrittr_2.0.3 kableExtra_1.4.0 formattable_0.2.1
## [10] bookdown_0.43 rmarkdown_2.28 knitr_1.44
## [13] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
## [16] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
## [19] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
## [22] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.52 bslib_0.5.1 remotes_2.4.2.1
## [5] htmlwidgets_1.6.4 processx_3.8.2 callr_3.7.3 tzdb_0.5.0
## [9] ps_1.7.5 vctrs_0.6.5 tools_4.3.1 generics_0.1.3
## [13] fansi_1.0.6 RefManageR_1.4.0 pkgconfig_2.0.3 lifecycle_1.0.4
## [17] compiler_4.3.1 textshaping_0.3.6 munsell_0.5.1 codetools_0.2-19
## [21] httpuv_1.6.14 sass_0.4.7 yaml_2.3.7 urlchecker_1.0.1
## [25] crayon_1.5.3 later_1.3.2 pillar_1.9.0 jquerylib_0.1.4
## [29] ellipsis_0.3.2 cachem_1.0.8 sessioninfo_1.2.2 mime_0.12
## [33] tidyselect_1.2.1 digest_0.6.33 stringi_1.8.4 fastmap_1.1.1
## [37] grid_4.3.1 colorspace_2.1-1 cli_3.6.5 pkgbuild_1.4.2
## [41] utf8_1.2.4 withr_3.0.2 prettyunits_1.2.0 promises_1.2.1
## [45] scales_1.3.0 backports_1.4.1 timechange_0.3.0 httr_1.4.7
## [49] hms_1.1.3 memoise_2.0.1 shiny_1.8.0 evaluate_0.22
## [53] miniUI_0.1.1.1 viridisLite_0.4.2 profvis_0.4.0 rlang_1.1.6
## [57] Rcpp_1.0.13 xtable_1.8-4 glue_1.8.0 formatR_1.14
## [61] xml2_1.3.6 pkgload_1.3.3 svglite_2.2.1 rstudioapi_0.15.0
## [65] jsonlite_1.8.9 R6_2.5.1 plyr_1.8.9 systemfonts_1.2.3
## [69] fs_1.6.3